{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 一.特征工程\n",
    "特征工程是比较重要的一环了，这一节先只介绍一些常用的特征工程手段，更多关于特征的操作放到后面模型优化中讲，因为业务建模通常需要尽快有一个快速的输出，而各种各样特征工程的探索一般比较耗时，往往在后期优化中进行，下面分别介绍如下几种：   \n",
    "\n",
    "（1）categorical 变量的处理；   \n",
    "\n",
    "（2）数据分箱；   \n",
    "\n",
    "（3）WOE特征； \n",
    "\n",
    "（4）数据归一化操作(z-score,min-max,normalize,log,boxcox,cdf,pdf变换等)；   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### categorical 变量的处理\n",
    "对于一般的机器学习模型，往往只支持数值型输入（lgb,catboost除外），所以第一步就是要将非数值型特征转换为数值型特征，这里我们可以按之前介绍的变量功能进行处理：    \n",
    "\n",
    "（1）对于可比较的特征，比如“高”、“中”、“低”这类，可以考虑LabelEncoding，将其编码为可比较的数值，比如3,2,1;    \n",
    "\n",
    "（2）对于不可比较的特征，比如行为爱好类：“游戏”，“篮球”，“跑步”....它们没法直接进行比较，所以对其编码3,2,1没有比较意义，对于线性模型这样与特征起着正相关性/负相关性的模型会有副作用，这时我们可以分两类看：   \n",
    "\n",
    "> （2.1）low-cardinality categorical特征：即分类数较少的特征，可以考虑作one-hot一类的特征转换；   \n",
    "\n",
    "> （2.2）high-cardinality categorical 特征：即分类数较多的特征，比如id,时间这样的特征，如果作ont-hot展开，会引起维度爆炸，而且对单个特征会有严重的数据不平衡问题，这时可以考虑使用target encoding,catboost encoding等特征转换  \n",
    "\n",
    "我对这一块的了解并不多，建议参考:http://contrib.scikit-learn.org/categorical-encoding/  \n",
    "\n",
    "画根线：  \n",
    "\n",
    "**对于树模型，使用one-hot编码要慎重，如果选择one-hot特征作切分可能会导致切分后的左右子树极端不平衡，对于样本量少的一边，其统计值（均值、众数）偏差往往很大，模型容易过拟合**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split,cross_val_score,KFold\n",
    "from sklearn.ensemble import GradientBoostingClassifier,RandomForestClassifier\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "train_df=pd.read_csv('./titanic/train.csv')\n",
    "test_df=pd.read_csv('./titanic/test.csv')\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df['Cabin'].fillna('missing',inplace=True)\n",
    "test_df['Cabin'].fillna('missing',inplace=True)\n",
    "train_df['Embarked'].fillna(train_df['Embarked'].mode()[0],inplace=True)\n",
    "train_df['Age'].fillna(train_df['Age'].mean(),inplace=True)\n",
    "test_df['Age'].fillna(train_df['Age'].mean(),inplace=True)\n",
    "test_df['Fare'].fillna(train_df['Fare'].mean(),inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import category_encoders as ce\n",
    "del train_df['Name']\n",
    "del train_df['Ticket']\n",
    "del test_df['Name']\n",
    "del test_df['Ticket']\n",
    "del train_df['PassengerId']\n",
    "del test_df['PassengerId']\n",
    "label=train_df[\"Survived\"]\n",
    "del train_df[\"Survived\"]\n",
    "# target \n",
    "target_encoder = ce.TargetEncoder(cols=['Embarked','Cabin']).fit(train_df,label)\n",
    "train_df=target_encoder.transform(train_df)\n",
    "test_df=target_encoder.transform(test_df)\n",
    "\n",
    "# one hot\n",
    "onehot_encoder = ce.OneHotEncoder(cols=['Sex']).fit(train_df)\n",
    "train_df=onehot_encoder.transform(train_df)\n",
    "test_df=onehot_encoder.transform(test_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 数据分箱\n",
    "数据分箱也是比较常见的操作，在一定程度上增加了数据的颗粒度，能避免过拟合，分箱的方式很多按照有监督和无监督可以分为：   \n",
    "\n",
    "（1）无监督：等距分箱，等频分箱，聚类分箱（后面会有用到），二值分箱；   \n",
    "\n",
    "（2）有监督：卡方分箱，最小熵分箱   \n",
    "\n",
    "[参考>>>](https://mp.weixin.qq.com/s?__biz=MzI4MjkzNTUxMw==&mid=2247484834&idx=1&sn=2ce321c5fa5e8df3b6d3c6b644aa141b&chksm=eb932c14dce4a502e05421a478796d9e2b8522dc583a609c77e2883b963c7670a5826c0a0c03&mpshare=1&scene=1&srcid=&sharer_sharetime=1586312923627&sharer_shareid=1e2e4bc7a27ef84a8a7e8bfedd080e02&key=a29eb00479c14a8ba349d29b5b684126f11bdaa6027411a066240c0a3c5e8b661e1466038e1df5822afaab891b3dff9677f96ffe7fdbefbf9884896a317d6f2257c8ed823eff434953cedbe74113d645&ascene=1&uin=MTk4MzMwODY4Mg%3D%3D&devicetype=Windows+10&version=62080079&lang=zh_CN&exportkey=A0CyFLsd47G%2BXNIR%2FFB6kfk%3D&pass_ticket=rfTc5pQJI1Ast4NBu%2BmXt3UDs%2FbiY3EnhijxRpOOGWVafOmugaq8EmMUfGDYL1OL)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "#对Fare等频分箱\n",
    "train_df['Fare_bins'],bins=pd.cut(train_df['Fare'],bins=10,labels=False,retbins=True)\n",
    "test_df['Fare_bins']=pd.cut(test_df['Fare'],bins=bins,labels=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### WOE\n",
    "WOE特征可以非常直观的表达当前分箱后的特征取值对y的贡献大小，它的计算公式：   \n",
    "\n",
    "$$\n",
    "WOE(feature_i)=ln(\\frac{N(y_{feature_i}=1)}{N(y_{feature_i}=0)})\n",
    "$$\n",
    "\n",
    "$feature_i$表示$feature$的第$i$个取值，$N(y_{feature_i}=1)$表示特征$feature$的第$i$个取值对应的$y=1$的数量\n",
    "\n",
    "参考:https://zhuanlan.zhihu.com/p/30026040"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "#计算训练集上的woe\n",
    "train_df['y']=label\n",
    "pos_df=(train_df[['Fare_bins','y']][train_df['y']==1]).groupby(by=['Fare_bins'])['y'].count().reset_index()\n",
    "pos_df.columns=['Fare_bins','pos']\n",
    "neg_df=(train_df[['Fare_bins','y']][train_df['y']==0]).groupby(by=['Fare_bins'])['y'].count().reset_index()\n",
    "neg_df.columns=['Fare_bins','neg']\n",
    "woe_df=pd.merge(pos_df,neg_df,on='Fare_bins')\n",
    "woe_df['woe']=np.log(woe_df['pos']/(1e-3+woe_df['neg'])+1e-5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Fare_bins</th>\n",
       "      <th>woe</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>-0.761548</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>0.664954</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>1.055931</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>0.559372</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>0.692652</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Fare_bins       woe\n",
       "0          0 -0.761548\n",
       "1          1  0.664954\n",
       "2          2  1.055931\n",
       "3          4  0.559372\n",
       "4          5  0.692652"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "woe_df[['Fare_bins','woe']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "del train_df['y']\n",
    "train_df=pd.merge(train_df,woe_df[['Fare_bins','woe']],how='left').fillna(0)\n",
    "test_df=pd.merge(test_df,woe_df[['Fare_bins','woe']],how='left').fillna(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 数据归一化\n",
    "数据是否需要标准化取决于后面的训练模型，对于树模型不需要归一化，而对于如下的模型往往需要归一化：   \n",
    "\n",
    "（1）knn,kmeans：计算欧氏距离...  \n",
    "（2）lr,svm,nn：便于梯度下降...  \n",
    "（3）pca：偏向于值较大的列  \n",
    "\n",
    "\n",
    "一般有以下几种：  \n",
    "（1）z-score：$z=(x-u)/\\sigma$  \n",
    "（2）min-max：$m=(x-x_{min})/(x_{max}-x_{min})$  \n",
    "（3）行归一化：$\\sqrt{(x_1^2+x_2^2+\\cdots+x_n^2)}=1,这里x_1,x_2,...,x_n表示每行特征$；  \n",
    "\n",
    "更多：https://www.jianshu.com/p/fa73a07cd750"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import StandardScaler,MinMaxScaler,Normalizer\n",
    "#z-score归一化为例\n",
    "standard_scaler=StandardScaler()\n",
    "standard_scaler.fit(train_df)\n",
    "new_train_df=pd.DataFrame(standard_scaler.transform(train_df),columns=train_df.columns)\n",
    "new_test_df=pd.DataFrame(standard_scaler.transform(test_df),columns=train_df.columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "pdf:标准正态分布的概率密度函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<matplotlib.axes._subplots.AxesSubplot at 0x1807a920080>"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYgAAAD8CAYAAABthzNFAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAFaNJREFUeJzt3X/wZXV93/HnywX5YYiAfLV0F1xINj/QUTBf0alNo2ArYnWxlRSbRkJpNiY40THTCupU05YZ7MRgnBrMGqyLjUHUKFvEpAuCjjMFXHDlh0hZgcq6O7ABBBGFgu/+cT9fvSyH/d7v8j3fc9l9Pmbu3HM+5/O5982F+31xzufcc1JVSJK0o2cMXYAkaToZEJKkTgaEJKmTASFJ6mRASJI6GRCSpE4GhCSpkwEhSepkQEiSOu01dAFPxSGHHFIrV64cugxJelq59tpr/76qZubr97QOiJUrV7Jx48ahy5Ckp5Uk/3eSfh5ikiR1MiAkSZ0MCElSJwNCktTJgJAkdTIgJEmdDAhJUicDQpLUyYCQJHV6Wv+SWpKGtPLMLw723nec87re38M9CElSJwNCktTJgJAkdTIgJEmdDAhJUicDQpLUyYCQJHUyICRJnXoPiCTLknwjySVt/YgkVye5Ncmnkzyzte/T1je37Sv7rk2S9OSWYg/i7cDNY+sfAM6tqlXAfcDprf104L6q+kXg3NZPkjSQXgMiyQrgdcBftvUAxwGfbV3WASe15dVtnbb9+NZfkjSAvvcgPgT8B+Anbf05wPer6tG2vgVY3paXA3cCtO33t/6SpAH0FhBJ/jlwd1VdO97c0bUm2Db+umuSbEyycfv27YtQqSSpS597EK8A3pDkDuBCRoeWPgQcmGTuKrIrgK1teQtwGEDb/mzg3h1ftKrWVtVsVc3OzMz0WL4k7dl6C4iqOquqVlTVSuAU4MtV9VvAFcCbWrdTgYvb8vq2Ttv+5ap6wh6EJGlpDPE7iHcB70yymdEcw/mt/XzgOa39ncCZA9QmSWqW5IZBVXUlcGVbvg04tqPPj4GTl6IeSdL8/CW1JKmTASFJ6mRASJI6GRCSpE4GhCSpkwEhSepkQEiSOhkQkqROBoQkqZMBIUnqZEBIkjoZEJKkTgaEJKmTASFJ6mRASJI69XlP6n2TXJPkm0luSvLHrf0TSW5Psqk9jm7tSfLhJJuTXJ/kJX3VJkmaX583DHoYOK6qHkyyN/C1JF9q2/59VX12h/6vBVa1x8uA89qzJGkAfd6Tuqrqwba6d3vs7B7Tq4EL2rirgAOTHNpXfZKknet1DiLJsiSbgLuBDVV1ddt0djuMdG6SfVrbcuDOseFbWpskaQC9BkRVPVZVRwMrgGOTvBA4C/gV4KXAwcC7Wvd0vcSODUnWJNmYZOP27dt7qlyStCRnMVXV94ErgROqals7jPQw8N+BY1u3LcBhY8NWAFs7XmttVc1W1ezMzEzPlUvSnqvPs5hmkhzYlvcDXg18e25eIUmAk4Ab25D1wFva2UwvB+6vqm191SdJ2rk+z2I6FFiXZBmjILqoqi5J8uUkM4wOKW0C3tr6XwqcCGwGHgJO67E2SdI8eguIqroeOKaj/bgn6V/AGX3VI0laGH9JLUnqZEBIkjoZEJKkTgaEJKmTASFJ6mRASJI6GRCSpE4GhCSpkwEhSepkQEiSOhkQkqROBoQkqZMBIUnqZEBIkjoZEJKkTgaEJKlTn7cc3TfJNUm+meSmJH/c2o9IcnWSW5N8OskzW/s+bX1z276yr9okSfPrcw/iYeC4qnoxcDRwQrvX9AeAc6tqFXAfcHrrfzpwX1X9InBu6ydJGkhvAVEjD7bVvdujgOOAz7b2dcBJbXl1W6dtPz5J+qpPkrRzvc5BJFmWZBNwN7AB+A7w/ap6tHXZAixvy8uBOwHa9vuB53S85pokG5Ns3L59e5/lS9IerdeAqKrHqupoYAVwLPCrXd3ac9feQj2hoWptVc1W1ezMzMziFStJepwlOYupqr4PXAm8HDgwyV5t0wpga1veAhwG0LY/G7h3KeqTJD1Rn2cxzSQ5sC3vB7wauBm4AnhT63YqcHFbXt/Wadu/XFVP2IOQJC2NvebvsssOBdYlWcYoiC6qqkuSfAu4MMl/Ab4BnN/6nw98MslmRnsOp/RYmyRpHr0FRFVdDxzT0X4bo/mIHdt/DJzcVz2SpIXxl9SSpE4GhCSpkwEhSepkQEiSOhkQkqROBoQkqZMBIUnqNFFAJHlh34VIkqbLpHsQH203//mDuctnSJJ2bxMFRFX9Y+C3GF1Mb2OSTyX5p71WJkka1MRzEFV1K/Be4F3AbwAfTvLtJP+ir+IkScOZdA7iRUnOZXQ11uOA11fVr7blc3usT5I0kEkv1vffgI8B766qH801VtXWJO/tpTJJ0qAmDYgTgR9V1WMASZ4B7FtVD1XVJ3urTpI0mEnnIC4D9htb37+1SZJ2U5MGxL5V9eDcSlvef2cDkhyW5IokNye5KcnbW/v7k3wvyab2OHFszFlJNie5JclrduUfSJK0OCY9xPTDJC+pqusAkvwa8KN5xjwK/FFVXZfkAODaJBvatnOr6k/GOyc5itFd5F4A/EPgsiS/NHdYS5K0tCYNiHcAn0myta0fCvyrnQ2oqm3Atrb8gyQ3A8t3MmQ1cGFVPQzc3m49eizwvyesUZK0iCYKiKr6epJfAX4ZCPDtqvp/k75JkpWMbj96NfAK4G1J3gJsZLSXcR+j8LhqbNgWdh4okqQeLeRifS8FXsToD/2b2x/4eSX5OeBzwDuq6gHgPOAXgKMZ7WF8cK5rx/DqeL01STYm2bh9+/YFlC9JWoiJ9iCSfJLRH/VNwNycQAEXzDNub0bh8FdV9TcAVXXX2PaPAZe01S2MLuUxZwWwlR1U1VpgLcDs7OwTAkSStDgmnYOYBY6qqon/ICcJcD5wc1X96Vj7oW1+AuCNwI1teT3wqSR/ymiSehVwzaTvJ0laXJMGxI3AP6BNOk/oFcBvAzck2dTa3s3o8NTRjPZA7gB+D6CqbkpyEfAtRmdAneEZTJI0nEkD4hDgW0muAR6ea6yqNzzZgKr6Gt3zCpfuZMzZwNkT1iRJ6tGkAfH+PouQJE2fSU9z/UqS5wOrquqyJPsDy/otTZI0pEkv9/27wGeBv2hNy4Ev9FWUJGl4k/4O4gxGk84PwE9vHvTcvoqSJA1v0oB4uKoemVtJshcdP2KTJO0+Jg2IryR5N7Bfuxf1Z4D/2V9ZkqShTRoQZwLbgRsY/W7hUkb3p5Yk7aYmPYvpJ4xuOfqxfsuRJE2LSa/FdDsdcw5VdeSiVyRJmgoLuRbTnH2Bk4GDF78cSdK0mGgOoqruGXt8r6o+BBzXc22SpAFNeojpJWOrz2C0R3FALxVJkqbCpIeYPji2/Cijq7D+5qJXI0maGpOexfSqvguRJE2XSQ8xvXNn28dvCCRJ2j1M+kO5WeD3GV2kbznwVuAoRvMQnXMRSQ5LckWSm5PclOTtrf3gJBuS3NqeD2rtSfLhJJuTXL/DvIckaYkt5IZBL6mqHwAkeT/wmar6dzsZ8yjwR1V1XZIDgGuTbAB+B7i8qs5JciajX2m/C3gto9uMrgJeBpzXniVJA5h0D+Jw4JGx9UeAlTsbUFXbquq6tvwD4GZGex+rgXWt2zrgpLa8GrigRq4CDkxy6IT1SZIW2aR7EJ8ErknyeUa/qH4jcMGkb5JkJXAMcDXwvKraBqMQSTJ32fDlwJ1jw7a0toXcB1uStEgmPYvp7CRfAn69NZ1WVd+YZGySnwM+B7yjqh5Ium5TPera9dYdr7cGWANw+OGHT1KCJGkXTHqICWB/4IGq+jNgS5Ij5huQZG9G4fBXVfU3rfmuuUNH7fnu1r4FOGxs+Apg646vWVVrq2q2qmZnZmYWUL4kaSEmveXo+xhNJJ/VmvYG/sc8YwKcD9y8w2mw64FT2/KpwMVj7W9pZzO9HLh/7lCUJGnpTToH8UZGcwhzk85b25lJO/MK4LeBG5Jsam3vBs4BLkpyOvBdRhf+g9E9Jk4ENgMPAadN+g8hSVp8kwbEI1VVSQogybPmG1BVX6N7XgHg+I7+xeje15KkKTDpHMRFSf6C0amnvwtchjcPkqTd2qRnMf1Juxf1A8AvA/+xqjb0WpkkaVDzBkSSZcDfVdWrAUNBkvYQ8x5iqqrHgIeSPHsJ6pEkTYlJJ6l/zOhspA3AD+caq+oPe6lKkjS4SQPii+0hSdpD7DQgkhxeVd+tqnU76ydJ2v3MNwfxhbmFJJ/ruRZJ0hSZLyDGf+h2ZJ+FSJKmy3wBUU+yLEnazc03Sf3iJA8w2pPYry3T1quqfr7X6iRJg9lpQFTVsqUqRJI0XRZyPwhJ0h7EgJAkdTIgJEmdeguIJB9PcneSG8fa3p/ke0k2tceJY9vOSrI5yS1JXtNXXZKkyfS5B/EJ4ISO9nOr6uj2uBQgyVHAKcAL2pg/b1eRlSQNpLeAqKqvAvdO2H01cGFVPVxVtzO67eixfdUmSZrfEHMQb0tyfTsEdVBrWw7cOdZnS2uTJA1kqQPiPOAXgKOBbcAHW3vXvas7f7mdZE2SjUk2bt++vZ8qJUlLGxBVdVdVPVZVP2F0T+u5w0hbgMPGuq4Atj7Ja6ytqtmqmp2Zmem3YEnagy1pQCQ5dGz1jcDcGU7rgVOS7JPkCGAVcM1S1iZJerxJbxi0YEn+GnglcEiSLcD7gFcmOZrR4aM7gN8DqKqbklwEfAt4FDij3epUkjSQ3gKiqt7c0Xz+TvqfDZzdVz2SpIXxl9SSpE4GhCSpkwEhSepkQEiSOhkQkqROBoQkqZMBIUnqZEBIkjoZEJKkTgaEJKmTASFJ6mRASJI6GRCSpE4GhCSpkwEhSepkQEiSOvUWEEk+nuTuJDeOtR2cZEOSW9vzQa09ST6cZHOS65O8pK+6JEmT6XMP4hPACTu0nQlcXlWrgMvbOsBrGd2HehWwBjivx7okSRPoLSCq6qvAvTs0rwbWteV1wElj7RfUyFXAgUkO7as2SdL8lnoO4nlVtQ2gPT+3tS8H7hzrt6W1PUGSNUk2Jtm4ffv2XouVpD3ZtExSp6OtujpW1dqqmq2q2ZmZmZ7LkqQ911IHxF1zh47a892tfQtw2Fi/FcDWJa5NkjRmryV+v/XAqcA57fnisfa3JbkQeBlw/9yhKElPDyvP/OJg733HOa8b7L13Z70FRJK/Bl4JHJJkC/A+RsFwUZLTge8CJ7fulwInApuBh4DT+qpLkjSZ3gKiqt78JJuO7+hbwBl91SJJWrhpmaSWJE0ZA0KS1MmAkCR1WuqzmCRp0Q15BtXuzD0ISVInA0KS1MmAkCR1MiAkSZ0MCElSJwNCktTJgJAkdTIgJEmdDAhJUicDQpLUyYCQJHUa5FpMSe4AfgA8BjxaVbNJDgY+DawE7gB+s6ruG6I+6enM6xJpsQy5B/Gqqjq6qmbb+pnA5VW1Cri8rUuSBjJNh5hWA+va8jrgpAFrkaQ93lABUcD/SnJtkjWt7XlVtQ2gPT93oNokSQx3P4hXVNXWJM8FNiT59qQDW6CsATj88MP7qq9XQx0jvuOc1w3yvnsq5wL0dDdIQFTV1vZ8d5LPA8cCdyU5tKq2JTkUuPtJxq4F1gLMzs7Wrtbgl1eSdm7JDzEleVaSA+aWgX8G3AisB05t3U4FLl7q2iRJPzPEHsTzgM8nmXv/T1XV3yb5OnBRktOB7wInD1CbJKlZ8oCoqtuAF3e03wMcv9T1SJK6DTVJrQEMOe/iBLn09DNNv4OQJE0RA0KS1MmAkCR1MiAkSZ0MCElSJwNCktTJgJAkdfJ3EFoSXvtKevpxD0KS1MmAkCR1MiAkSZ0MCElSJwNCktTJgJAkdZq6gEhyQpJbkmxOcubQ9UjSnmqqAiLJMuAjwGuBo4A3Jzlq2Kokac80VQEBHAtsrqrbquoR4EJg9cA1SdIeadoCYjlw59j6ltYmSVpi03apjXS01eM6JGuANW31wSS37OJ7HQL8/S6O7dO01gXTW5t1LYx1LcxU1pUPALte2/Mn6TRtAbEFOGxsfQWwdbxDVa0F1j7VN0qysapmn+rrLLZprQumtzbrWhjrWphprQv6r23aDjF9HViV5IgkzwROAdYPXJMk7ZGmag+iqh5N8jbg74BlwMer6qaBy5KkPdJUBQRAVV0KXLoEb/WUD1P1ZFrrgumtzboWxroWZlrrgp5rS1XN30uStMeZtjkISdKU2C0DYr7LdSTZJ8mn2/ark6wc23ZWa78lyWumoa4kK5P8KMmm9vjoEtf1T5Jcl+TRJG/aYdupSW5tj1OnqK7Hxj6vRT3RYYK63pnkW0muT3J5kuePbevt81qE2ob8zN6a5Ib23l8bv4LCwN/JzrqG/k6O9XtTkkoyO9a2eJ9XVe1WD0aT298BjgSeCXwTOGqHPn8AfLQtnwJ8ui0f1frvAxzRXmfZFNS1ErhxwM9rJfAi4ALgTWPtBwO3teeD2vJBQ9fVtj044Of1KmD/tvz7Y/8ee/u8nmptU/CZ/fzY8huAv23LQ38nn6yuQb+Trd8BwFeBq4DZPj6v3XEPYpLLdawG1rXlzwLHJ0lrv7CqHq6q24HN7fWGrqtP89ZVVXdU1fXAT3YY+xpgQ1XdW1X3ARuAE6agrj5NUtcVVfVQW72K0e95oN/P66nW1qdJ6npgbPVZ/OwHsoN+J3dSV58mveTQfwb+K/DjsbZF/bx2x4CY5HIdP+1TVY8C9wPPmXDsEHUBHJHkG0m+kuTXF6mmSevqY2zfr71vko1Jrkpy0iLVtCt1nQ58aRfHLmVtMPBnluSMJN9h9EfvDxcydoC6YMDvZJJjgMOq6pKFjl2IqTvNdRHMe7mOnfSZZOyueip1bQMOr6p7kvwa8IUkL9jh/276rKuPsX2/9uFVtTXJkcCXk9xQVd9ZyrqS/BtgFviNhY7dRU+lNhj4M6uqjwAfSfKvgfcCp046doC6BvtOJnkGcC7wOwsdu1C74x7EvJfrGO+TZC/g2cC9E45d8rra7uI9AFV1LaPjir+0hHX1MbbX166qre35NuBK4JilrCvJq4H3AG+oqocXMnag2gb/zMZcCMztwUzTf2M/rWvg7+QBwAuBK5PcAbwcWN8mqhf38+pjkmXIB6O9otsYTdDMTfC8YIc+Z/D4yeCL2vILePwEz20s3oTYU6lrZq4ORhNX3wMOXqq6xvp+gidOUt/OaML1oLY8DXUdBOzTlg8BbqVjkq/Hf4/HMPqDsWqH9t4+r0WobejPbNXY8uuBjW156O/kk9U1Fd/J1v9KfjZJvaif16L8hzltD+BE4P+0L8J7Wtt/YvR/TAD7Ap9hNIFzDXDk2Nj3tHG3AK+dhrqAfwnc1P7FXwe8fonreimj/zP5IXAPcNPY2H/b6t0MnDYNdQH/CLihfV43AKcvcV2XAXcBm9pj/VJ8Xk+ltin4zP6s/Te+CbiCsT+IA38nO+sa+ju5Q98raQGx2J+Xv6SWJHXaHecgJEmLwICQJHUyICRJnQwISVInA0KS1MmAkCR1MiAkSZ0MCElSp/8PWsGS+iXmMOcAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x180687b6ef0>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from scipy.stats import norm\n",
    "age_mean=train_df['Age'].mean()\n",
    "age_std=train_df['Age'].std()\n",
    "new_train_df['age_pdf']=train_df['Age'].apply(lambda x:norm.pdf((x-age_mean)/age_std))\n",
    "new_test_df['age_pdf']=test_df['Age'].apply(lambda x:norm.pdf((x-age_mean)/age_std))\n",
    "new_train_df['age_pdf'].plot(kind='hist')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "cdf:分布函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "new_train_df['age_cdf']=train_df['Age'].apply(lambda x:norm.cdf((x-age_mean)/age_std))\n",
    "new_test_df['age_cdf']=test_df['Age'].apply(lambda x:norm.cdf((x-age_mean)/age_std))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "boxcox:将任意分别转换为正态分布\n",
    "\n",
    "$$\n",
    "y=\\left\\{\\begin{matrix}\n",
    "\\frac{(x^\\lambda-1)}{\\lambda} &\\lambda>0 \\\\ \n",
    "log(x) &\\lambda=0 \n",
    "\\end{matrix}\\right.\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<matplotlib.axes._subplots.AxesSubplot at 0x1807a8f1e10>"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYgAAAD8CAYAAABthzNFAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAE0BJREFUeJzt3X/wXXV95/Hny5AKVttI+cKm+bFBN22ljgb3a8qM7ixF2yJagzulC9OtjMMadxanOrWtgemuurPM6IwV63SHNhZqtFaMv0pW6boRsa5/CARFfkWHVFn4NhmSVhAoNiz43j/uyXBJPny/9xtzvvfm+30+Zu7ccz73c+558xmSV87nnHtOqgpJkg73rHEXIEmaTAaEJKnJgJAkNRkQkqQmA0KS1GRASJKaDAhJUpMBIUlqMiAkSU0njLuAH8cpp5xS69atG3cZknRcufXWW/+hqqbm6ndcB8S6devYtWvXuMuQpONKkv87Sj+nmCRJTQaEJKnJgJAkNRkQkqQmA0KS1GRASJKaDAhJUpMBIUlq6i0gkpyY5OYk30pyV5L3dO0fSfK9JLd1rw1de5J8KMmeJLcneVlftUmS5tbnL6kPAudU1aNJlgNfS/I33We/X1WfPqz/a4D13euXgKu6d+mordvyhbHt+973vnZs+5aOhd6OIGrg0W51efeqWTbZBHy02+7rwIokK/uqT5I0u17PQSRZluQ2YD+ws6pu6j66optGujLJs7u2VcD9Q5vPdG2SpDHoNSCq6smq2gCsBjYmeTFwGfALwMuBk4F3dt3T+orDG5JsTrIrya4DBw70VLkkaUGuYqqqh4CvAOdW1b5uGukg8BfAxq7bDLBmaLPVwN7Gd22tqumqmp6amvNutZKko9TnVUxTSVZ0yycBrwa+fei8QpIA5wN3dpvsAN7YXc10FvCDqtrXV32SpNn1eRXTSmBbkmUMgmh7VX0+yZeTTDGYUroN+E9d/+uB84A9wGPAm3qsTZI0h94CoqpuB85stJ/zDP0LuLSveiRJ8+MvqSVJTQaEJKnJgJAkNRkQkqQmA0KS1GRASJKaDAhJUpMBIUlqMiAkSU0GhCSpyYCQJDUZEJKkJgNCktRkQEiSmgwISVKTASFJajIgJElNBoQkqcmAkCQ1GRCSpKbeAiLJiUluTvKtJHcleU/XfnqSm5Lck+STSX6ia392t76n+3xdX7VJkubW5xHEQeCcqnopsAE4N8lZwPuAK6tqPfAgcEnX/xLgwar6V8CVXT9J0pj0FhA18Gi3urx7FXAO8OmufRtwfre8qVun+/xVSdJXfZKk2fV6DiLJsiS3AfuBncDfAQ9V1RNdlxlgVbe8CrgfoPv8B8DP9FmfJOmZ9RoQVfVkVW0AVgMbgRe1unXvraOFOrwhyeYku5LsOnDgwLErVpL0NAtyFVNVPQR8BTgLWJHkhO6j1cDebnkGWAPQff7TwPcb37W1qqaranpqaqrv0iVpyerzKqapJCu65ZOAVwO7gRuB3+i6XQxc1y3v6NbpPv9yVR1xBCFJWhgnzN3lqK0EtiVZxiCItlfV55PcDVyb5L8D3wSu7vpfDXwsyR4GRw4X9libJGkOvQVEVd0OnNlo/y6D8xGHt/8zcEFf9UiS5sdfUkuSmgwISVKTASFJajIgJElNBoQkqcmAkCQ1GRCSpCYDQpLUZEBIkpoMCElSkwEhSWoyICRJTQaEJKnJgJAkNRkQkqQmA0KS1GRASJKaDAhJUpMBIUlqMiAkSU29BUSSNUluTLI7yV1J3ta1vzvJ3ye5rXudN7TNZUn2JPlOkl/rqzZJ0txO6PG7nwDeUVXfSPI84NYkO7vPrqyq9w93TnIGcCHwi8DPAl9K8nNV9WSPNUqSnkFvRxBVta+qvtEtPwLsBlbNsskm4NqqOlhV3wP2ABv7qk+SNLsFOQeRZB1wJnBT1/TWJLcnuSbJ87u2VcD9Q5vNMHugSJJ61HtAJHku8Bng7VX1MHAV8EJgA7AP+KNDXRubV+P7NifZlWTXgQMHeqpaktRrQCRZziAcPl5VnwWoqgeq6smq+hHwYZ6aRpoB1gxtvhrYe/h3VtXWqpququmpqak+y5ekJa3Pq5gCXA3srqoPDLWvHOr2BuDObnkHcGGSZyc5HVgP3NxXfZKk2fV5FdMrgN8G7khyW9d2OXBRkg0Mpo/uBd4CUFV3JdkO3M3gCqhLvYJJksant4Coqq/RPq9w/SzbXAFc0VdNkqTR+UtqSVKTASFJajIgJElNBoQkqcmAkCQ1GRCSpCYDQpLUZEBIkppGCogkL+67EEnSZBn1COJPk9yc5D8nWdFrRZKkiTBSQFTVK4HfYnC31V1J/irJr/RamSRprEY+B1FV9wB/CLwT+LfAh5J8O8m/66s4SdL4jHoO4iVJrmTw2NBzgF+vqhd1y1f2WJ8kaUxGvZvrnzB4uM/lVfXDQ41VtTfJH/ZSmSRprEYNiPOAHx56PkOSZwEnVtVjVfWx3qqTJI3NqOcgvgScNLT+nK5NkrRIjRoQJ1bVo4dWuuXn9FOSJGkSjBoQ/5TkZYdWkvxr4Iez9JckHedGPQfxduBTSfZ26yuBf99PSZKkSTBSQFTVLUl+Afh5Bs+Z/nZV/b9eK5MkjdV8btb3cuAlwJnARUneOFvnJGuS3Jhkd5K7krytaz85yc4k93Tvz+/ak+RDSfYkuX14SkuStPBG/aHcx4D3A69kEBQvB6bn2OwJ4B3dD+rOAi5NcgawBbihqtYDN3TrAK8B1nevzcBV8/tPkSQdS6Oeg5gGzqiqGvWLq2ofsK9bfiTJbmAVsAk4u+u2DfgKg9t3bAI+2u3j60lWJFnZfY8kaYGNOsV0J/AvjnYnSdYxmJq6CTjt0F/63fupXbdVwP1Dm810bZKkMRj1COIU4O4kNwMHDzVW1evn2jDJc4HPAG+vqoeTPGPXRtsRRyxJNjOYgmLt2rVzVy5JOiqjBsS7j+bLkyxnEA4fr6rPds0PHJo6SrIS2N+1zzC4nfghq4G9HKaqtgJbAaanp0ee8pIkzc+oz4P4W+BeYHm3fAvwjdm2yeBQ4Wpgd1V9YOijHcDF3fLFwHVD7W/srmY6C/iB5x8kaXxGOoJI8mYG0zonAy9kcG7gT4FXzbLZK4DfBu5IclvXdjnwXmB7kkuA+4ALus+uZ3BTwD3AY8Cb5vVfIkk6pkadYroU2MjgJDNVdU+SU2fboKq+Rvu8AjSCpbt66dIR65Ek9WzUq5gOVtXjh1aSnEDjBLIkafEYNSD+NsnlwEnds6g/BfzP/sqSJI3bqAGxBTgA3AG8hcH5Ap8kJ0mL2Kg36/sRg0eOfrjfciRJk2LUq5i+R+OcQ1W94JhXJEmaCPO5F9MhJzK4NPXkY1+OJGlSjPpDuX8cev19VX0QOKfn2iRJYzTqFNPwsxmexeCI4nm9VCRJmgijTjH90dDyEwxuu/Gbx7waSdLEGPUqpl/uuxBJ0mQZdYrpd2f7/LCb8UmSFoH5XMX0cgZ3XAX4deCrPP0BP5KkRWQ+Dwx6WVU9ApDk3cCnquo/9lWYJGm8Rr3Vxlrg8aH1x4F1x7waSdLEGPUI4mPAzUk+x+AX1W8APtpbVZKksRv1KqYrkvwN8G+6pjdV1Tf7K0uSNG6jTjEBPAd4uKr+GJhJcnpPNUmSJsBIAZHkXcA7gcu6puXAX/ZVlCRp/EY9gngD8HrgnwCqai/eakOSFrVRA+Lx7pnRBZDkJ/srSZI0CUYNiO1J/gxYkeTNwJfw4UGStKiNehXT+7tnUT8M/DzwX6tq52zbJLkGeB2wv6pe3LW9G3gzg8eXAlxeVdd3n10GXAI8CfxOVX1x/v85mlTrtnxh3CVImqc5AyLJMuCLVfVqYNZQOMxHgD/hyN9LXFlV7z9sH2cAFwK/CPws8KUkP1dVT85jf5KkY2jOKabuL+nHkvz0fL64qr4KfH/E7puAa6vqYFV9D9gDbJzP/iRJx9aov6T+Z+COJDvprmQCqKrfOYp9vjXJG4FdwDuq6kFgFfD1oT4zXdsRkmwGNgOsXbv2KHYvSRrFqCepvwD8FwZ3cL116DVfVwEvBDYA+3jqQURp9K3WF1TV1qqarqrpqampoyhBkjSKWY8gkqytqvuqatux2FlVPTD03R8GPt+tzgBrhrquBvYei31Kko7OXEcQf31oIclnftydJVk5tPoG4M5ueQdwYZJnd7fwWA/c/OPuT5J09OY6BzE89fOC+Xxxkk8AZwOnJJkB3gWcnWQDg+mje4G3AFTVXUm2A3czeOb1pV7BJEnjNVdA1DMsz6mqLmo0Xz1L/yuAK+azD0lSf+YKiJcmeZjBkcRJ3TLdelXVT/VanSRpbGYNiKpatlCFSJImy3yeByFJWkIMCElSkwEhSWoyICRJTQaEJKnJgJAkNRkQkqQmA0KS1GRASJKaDAhJUpMBIUlqMiAkSU0GhCSpyYCQJDUZEJKkJgNCktRkQEiSmgwISVJTbwGR5Jok+5PcOdR2cpKdSe7p3p/ftSfJh5LsSXJ7kpf1VZckaTR9HkF8BDj3sLYtwA1VtR64oVsHeA2wvnttBq7qsS5J0gh6C4iq+irw/cOaNwHbuuVtwPlD7R+tga8DK5Ks7Ks2SdLcFvocxGlVtQ+gez+1a18F3D/Ub6ZrO0KSzUl2Jdl14MCBXouVpKVsUk5Sp9FWrY5VtbWqpqtqempqqueyJGnpWuiAeODQ1FH3vr9rnwHWDPVbDexd4NokSUNOWOD97QAuBt7bvV831P7WJNcCvwT84NBUlI6ddVu+MO4SJB1HeguIJJ8AzgZOSTIDvItBMGxPcglwH3BB1/164DxgD/AY8Ka+6pIkjaa3gKiqi57ho1c1+hZwaV+1SOMwriO2e9/72rHsV4vPpJykliRNGANCktRkQEiSmgwISVKTASFJajIgJElNBoQkqcmAkCQ1GRCSpCYDQpLUtNA365PUM2/xoWPFIwhJUpMBIUlqMiAkSU0GhCSpyYCQJDUZEJKkJgNCktRkQEiSmgwISVLTWH5JneRe4BHgSeCJqppOcjLwSWAdcC/wm1X14DjqkySN9wjil6tqQ1VNd+tbgBuqaj1wQ7cuSRqTSZpi2gRs65a3AeePsRZJWvLGFRAF/O8ktybZ3LWdVlX7ALr3U8dUmySJ8d3N9RVVtTfJqcDOJN8edcMuUDYDrF27tq/6JGnJG8sRRFXt7d73A58DNgIPJFkJ0L3vf4Ztt1bVdFVNT01NLVTJkrTkLHhAJPnJJM87tAz8KnAnsAO4uOt2MXDdQtcmSXrKOKaYTgM+l+TQ/v+qqv5XkluA7UkuAe4DLuiziHE9VAV8sIqk48OCB0RVfRd4aaP9H4FXLXQ9kqS2SbrMVZI0QQwISVKTASFJajIgJElNBoQkqWlcv6SWtMh46fji4xGEJKnJgJAkNRkQkqQmz0GMwTjnaiVpVB5BSJKaDAhJUpMBIUlqMiAkSU0GhCSpyauYJB33xnVl4GL/BbdHEJKkJgNCktRkQEiSmgwISVLTxAVEknOTfCfJniRbxl2PJC1VE3UVU5JlwP8AfgWYAW5JsqOq7h5vZZJ0pMX+DIxJO4LYCOypqu9W1ePAtcCmMdckSUvSpAXEKuD+ofWZrk2StMAmaooJSKOtntYh2Qxs7lYfTfKdY7j/U4B/OIbftxg4JkdyTJ7O8ThS72OS9/1Ym//LUTpNWkDMAGuG1lcDe4c7VNVWYGsfO0+yq6qm+/ju45VjciTH5OkcjyMtljGZtCmmW4D1SU5P8hPAhcCOMdckSUvSRB1BVNUTSd4KfBFYBlxTVXeNuSxJWpImKiAAqup64Pox7b6XqavjnGNyJMfk6RyPIy2KMUlVzd1LkrTkTNo5CEnShDAg8PYeAEmuSbI/yZ1DbScn2Znknu79+eOscaElWZPkxiS7k9yV5G1d+5IdlyQnJrk5ybe6MXlP1356kpu6Mflkd5HJkpJkWZJvJvl8t37cj8mSD4ih23u8BjgDuCjJGeOtaiw+Apx7WNsW4IaqWg/c0K0vJU8A76iqFwFnAZd2/28s5XE5CJxTVS8FNgDnJjkLeB9wZTcmDwKXjLHGcXkbsHto/bgfkyUfEHh7DwCq6qvA9w9r3gRs65a3AecvaFFjVlX7quob3fIjDP7wr2IJj0sNPNqtLu9eBZwDfLprX1JjApBkNfBa4M+79bAIxsSA8PYeszmtqvbB4C9L4NQx1zM2SdYBZwI3scTHpZtKuQ3YD+wE/g54qKqe6LosxT9DHwT+APhRt/4zLIIxMSBGuL2HlrYkzwU+A7y9qh4edz3jVlVPVtUGBnc62Ai8qNVtYasanySvA/ZX1a3DzY2ux92YTNzvIMZgztt7LGEPJFlZVfuSrGTwL8YlJclyBuHw8ar6bNe85McFoKoeSvIVBudnViQ5ofsX81L7M/QK4PVJzgNOBH6KwRHFcT8mHkF4e4/Z7AAu7pYvBq4bYy0LrptHvhrYXVUfGPpoyY5LkqkkK7rlk4BXMzg3cyPwG123JTUmVXVZVa2uqnUM/v74clX9FotgTPyhHNAl/wd56vYeV4y5pAWX5BPA2QzuQvkA8C7gr4HtwFrgPuCCqjr8RPaileSVwP8B7uCpueXLGZyHWJLjkuQlDE64LmPwD8ztVfXfkryAwQUeJwPfBP5DVR0cX6XjkeRs4Peq6nWLYUwMCElSk1NMkqQmA0KS1GRASJKaDAhJUpMBIUlqMiAkSU0GhCSpyYCQJDX9fyyZfpI/EKGIAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x1807a909f60>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from scipy.stats import boxcox\n",
    "new_train_df['age_boxcox']=boxcox(train_df['Age'])[0]\n",
    "new_test_df['age_boxcox']=boxcox(test_df['Age'],lmbda=boxcox(train_df['Age'])[1])[0]\n",
    "new_train_df['age_boxcox'].plot(kind='hist')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 二.模型选择\n",
    "\n",
    "首先，我们需要明白我们的建模的目标是什么？然后怎么去量化我们的目标？  \n",
    "\n",
    "\n",
    "（1）**目标**：本数据集是预测乘客是否存活，所以可以看做是分类任务；  \n",
    "\n",
    "（2）**量化目标**：选择合适的评估指标，这里我们可以选择f1；  \n",
    "\n",
    "当然，从众多分类模型中选择一个较优的模型作为基准模型，是一个比较繁琐的工作，一般来说用的比较多是RF、LR、GBDT等，通常为了让模型的评估更加客观，我们需要选择交叉验证的方式  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.7509734792424432, 0.01775967622151036)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "classifier=LogisticRegression()\n",
    "scores = cross_val_score(classifier, train_df, label, scoring='f1', cv = 5)#注意：f1只是看正样本的f1,如果要看整体的用f1_macro,但这一般会使得f1偏高\n",
    "np.mean(scores),np.std(scores)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.7753721187015974, 0.047352448054968535)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#我们再看看另一种分类器\n",
    "classifier=GradientBoostingClassifier()\n",
    "scores = cross_val_score(classifier, train_df, label, scoring='f1', cv = 5)\n",
    "np.mean(scores),np.std(scores)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 偏差/方差\n",
    "我们查看了f1的均值与标准差,其中，均值反映模型的预测能力，标准差可以反映模型的稳定性，按照术语来说前者反映了模型的偏差，后面反映了模型的方差，我们可以将模型的能力用类似于如下的图来定义，显然我们倾向于选择低偏差低方差的模型：  \n",
    "![avatar](./source/方差与偏差.png)\n",
    "来源:https://blog.csdn.net/hertzcat/article/details/80035330"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 过拟合/欠拟合\n",
    "过拟合/欠拟合也是检验模型好坏的一种评估手段，它通常应用在训练阶段：  \n",
    "\n",
    ">（1）训练集/验证集效果都比较差，可以看作是欠拟合(除非训练数据真的是太差了)，这时可以增加模型的复杂度试一试；  \n",
    "\n",
    ">（2）训练集的表现好，而验证集的表现差，大部分情况下是过拟合（这也是经常会遇到的问题）可以尝试一下的方法；     \n",
    "\n",
    "过拟合的解决方法：   \n",
    "\n",
    "1). 降低模型复杂度：1.换更简单的模型，2.正则化技术    \n",
    "\n",
    "    \n",
    "2). 增强训练数据 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**补充一点：**   \n",
    "\n",
    "训练集上效果好，验证集上效果差，也有可能是数据分布差异大造成的影响，如果由于客观原因使得训练集和验证集不能从同一分布做随机采样（比如与时间相关的业务，只能使用T+1的数据做验证集），这时可以从训练集中分一小部分数据出来不做训练，然后对比结果：   \n",
    "\n",
    "（1）如果未训练的样本效果与训练集一致，那么可能是数据分布差异带来的影响；   \n",
    "\n",
    "（2）如果未训练的样本效果与验证集一致，那么过拟合的可能性更大；  \n",
    "\n",
    "![avatar](./source/过拟合与数据差异进行对比.png)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
